Information resource taxonomy

ABSTRACT

An information resource taxonomy system, including a data collector for collecting information resources from a communications network; and a taxonomy generator for generating a taxonomy represented by a hierarchy of resource clusters, using cluster criteria generated from the collected resources. The system includes an editor for editing the criteria, and a renderer for generating linked document data for displaying the hierarchy. A parallel cluster search system is used to evaluate clusters in parallel. The system also includes a parallel classifier for classifying further collected resources.

FIELD OF THE INVENTION

The present invention relates to taxonomies for information resources,and in particular to a system and process for generating a taxonomy fora plurality of information resources in a communications network.

BACKGROUND

The enormous number of stored electronic documents and other informationresources available in modern communications networks such as theInternet poses particular problems for classification andcategorisation. For example, the world wide web provides access to anever-increasing number of electronic documents, many of them generateddynamically, and it is often difficult to retrieve a document ofinterest without knowing in advance at least part of an identifier,address or locator for the resource. For this reason, search engineshave been developed which attempt to generate lists of relevantdocuments in response to keywords typed in by a user. However, suchsearches are limited by the choice of keywords entered by the user. Asan alternative, directories of web resources have been created by manualvetting and categorisation of web documents into hierarchical categorystructures known as web directories. These directories are extremelyuseful for locating relevant documents once a particular category hasbeen chosen. However, the development of these directories is achallenge in itself. For example, companies such as Yahoo! have employedmore than 300 people for maintaining the structure of their onlinedirectory. This level of expenditure is not justifiable for mostcompanies. More recently, some solutions have appeared which replace themanual vetting with automatic classification based on a manually createdtaxonomy. Although this alleviates the problem to some extent, themanpower needed to create and maintain the appropriate taxonomy is stillconsiderable. It is desired, therefore, to provide an improved systemand process for generating a taxonomy for information resources in acommunications network, or at least a useful alternative.

SUMMARY OF THE INVENTION

In accordance with the present invention there is provided a process forgenerating a taxonomy for a plurality of information resources in acommunications network, including:

-   -   collecting said resources from said network;    -   generating cluster criteria from said resources; and    -   generating said taxonomy as a hierarchy of resource clusters        based on said criteria, wherein the number of said resource        clusters is determined by content of said resources.

The present invention also provides an information resource taxonomysystem, including

-   -   a data collector for collecting information resources from a        communications network; and    -   a taxonomy generator for generating a taxonomy represented by a        hierarchy of resource clusters, using cluster criteria generated        from said resources, wherein the number of said resource        clusters is determined by content of said resources.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the present invention are hereinafterdescribed, by way of example only, with reference to the accompanyingdrawings, wherein:

FIG. 1 is a schematic diagram of a preferred embodiment of aninformation resource taxonomy system;

FIG. 2 is a flow diagram of a data collection process executed by a datacollector of the system;

FIG. 3 is a flow diagram of a pre-processing process executed by apre-processor of the system; and

FIG. 4 is a graph of the goodness value of a document set as a functionof the cluster threshold.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

As shown in FIG. 1, an information resource taxonomy system includes adata collector 10, a data processing system 12, a renderer 14, and amanagement system 16. The taxonomy system executes a taxonomy generationprocess that automatically generates a taxonomy from structured orunstructured documents or other information resources, and can be usedto maintain the taxonomy. The taxonomy is a hierarchical tree structurethat organizes resources into clusters or nodes based on theirsimilarity, and can include the resources themselves. The taxonomy issubsequently used by the renderer 14 to generate markup code such asHTML, XML, or ASP that provides an interactive, hierarchical view intothe space of documents or other information resources. A user of theInternet can view the hierarchy and open individual documents or otherinformation resources over the Internet using a web browser 32 to accessthe markup code generated by the renderer 14 and generate a graphicaldisplay of the hierarchy. The taxonomy system can be applied to avariety of taxonomy generation tasks such as site management ofcorporate intranets and external web sites.

An administrator of the taxonomy system can login to the system from aterminal associated with the management system 16. The administrator canthen submit to the taxonomy system a text file that defines the taxonomyspecifications, i.e., the taxonomy creation tasks to be performed by thesystem. This file includes a list of universal resource indicators(URIs) and a corresponding list of ‘include’ specifications. The URIsindicate high-level domains that are to be clustered or categorised bythe taxonomy system, and the ‘include’ specifications indicate the typesof documents that are to be included in the taxonomy. For example, itmay be desired to include only textual documents in one or more of thefollowing formats: HTML, text, Microsoft Word®, FrameMaker, andStarOffice. The text file containing these specifications is sent to thedata collector 10.

The components of the taxonomy system can be implemented using standardcomputer system hardware and adding unique software modules. Forexample, the data collector 10 and the renderer 4 are 850 MHz Pentium 3and 1.5 GHz Pentium 4 personal computers, respectively, each running aLinux operating system. The data processing system 12 is a Sun UltraEnterprise four-CPU server running a Solaris 8 operating system. Themanagement system 16 is a 1.5 GHz Pentium 4 personal computer running aWindows XP operating system. The data processing system 12 includes anumber of data processing modules 18 to 26, including a pre-processor18, a sampler 20 a clusterer and classifier 22, a taxonomy database 24,and a post processor module 26. The data processing system 12 canfurther include parallel clusterers 28, and/or parallel classifiers 30.The renderer 14 includes a taxonomy rendering module 15 and a web servermodule 17. The management system 16 includes a process managementcomponent 19 and an editor module 21. Whilst these modules arepreferably implemented by software code, at least some of the processingsteps executed by the modules, described below, may be implemented byhardware circuits such as application-specific integrated circuits(ASICs).

The data collector 10 executes a data collection process, as shown inFIG. 2. The data collection process begins at step 34 when the taxonomyspecifications are received. The collector 10 uses the specifications tonavigate or “crawl” the Internet at step 36, starting at the top leveldomains provided by the URI lists and progressing down to sub-domainsthereof. The crawling process is known in the art. Briefly, the datacollector 10 performs HTTP GET requests to network servers indicated bythe provided URIs, or by links within HTML data previously retrievedfrom the network, including only those links that match the includespecifications. For each document retrieved, the data collector 10converts any documents that are not in HTML into HTML at step 38. Theresulting HTML data is then sent to the data processing system 12 atstep 40. If the data collector 10 has exhausted all of the hyperlinkscontained within documents retrieved from the network, then the processbranches at step 42 to return to step 34, and waits for the nextcategory specification to be submitted by an administrator.Alternatively, if it is determined at step 42 that more data needs to becollected, the process branches back to step 36 in order to retrievemore data from the network.

HTML data sent to the data processing system 12 from the data collector10 is received by the pre-processor 18. Alternatively, HTML data can bedirectly submitted to the pre-processor 18 by the administrator usingthe management system 16. The pre-processor 18 executes an HTMLprocessing process, as shown in FIG. 3. The process begins when HTMLdata is received by the pre-processor 18 at step 44. Metadata tags arethen extracted from the HTML data at step 46. This is achieved byregular expression matching on predefined patterns such as the HTML tags<TITLE> <META . . . > and so on. Meta information is included in theoutput from the pre-processor 18 as text-delimited additions to thedata. The delimiters are text markups that do not normally occur in thedata, e.g., “xxxxxxxx:”. The remaining data is then processed at step 48by a filter that removes data that is not considered to be important.This includes removing text that appears likely to be a component of anadvertising table or banner. Commonly occurring noise strings areremoved by stoplists or by statistical analysis. For example, noisereduction can be achieved by building a frequency table of strings foundin the document set. These strings are the characters found betweenmatching pairs of HTML tags, such as <TD> and </TD>. A string is removedfrom the document set if its occurrence frequency exceeds a thresholdvalue. At step 50, the pre-processor 18 converts the remaining HTML totext by removing HTML tags. The resulting text document is then sent tothe sampler 20 at step 52. The sampler 20 samples a fixed fraction ofincoming documents, as described below. The sample documents are thenprocessed by the clusterer/classifier 22.

The clusterer 22 partitions the documents based on their content. Itdoes this by forming groups or clusters of documents based on theirnatural affinity rather than requiring a pre-specified number ofcategories. The clustering and feature selection processes are basedupon processes described in the specification of International PatentApplication No. PCT/AU01/01198 (“the TACT specification”), incorporatedherein by reference. First, each document is represented by a wordfrequency vector including words from the document and their frequenciesof occurrence, where some words are excluded using feature selectioncriteria. A numeric similarity measure is then determined as a functionof any two word vectors to determine the similarity of any twodocuments. For example, a new cluster can be formed by two documents iftheir similarity falls within a threshold similarity value forclustering. Once formed, a cluster is characterized by a word frequencyvector that is the average of the word frequency vectors of itsconstituent documents. This average word frequency vector is referred toas the cluster centroid. The similarity measure used is the cosinesimilarity function, described in the TACT specification. The clusteringprocess uses this similarity measure to group similar documents intoclusters by assigning each document to the most similar cluster. Anoptimal similarity threshold value for creating clusters from a givendocument set is determined by creating different groupings of thedocuments at different thresholds and then evaluating these to determinethe best grouping, as described in An Evaluation of Criteria forMeasuring the Quality of Clusters by B. Raskutti and C. Leckie, pp.905-910, in Proceedings of the Sixteenth International Joint Conferenceon Artificial Intelligence, 1999. This evaluation is based on minimisinga goodness value that is based on the similarity of documents withinclusters, which tends to reduce the number of documents in each cluster,and the separation of cluster centroids from the global centroid, whichencourages larger clusters. For example, a goodness value for a documentset can be determined by simply summing these two values.

Hierarchical clustering is achieved by iterative clustering of larger,less coherent clusters. The coherence of a cluster is determined by theintra-cluster similarity value of the cluster. If the documents in acluster are very similar, i.e., the similarity values of each documentwith the cluster centroid fall within a similarity threshold forcoherence, then the cluster is deemed coherent. If this criterion is notmet, then documents within the cluster are formed into sub-clusters ofthe original cluster. These sub-clusters are sub-nodes of the originalparent cluster or node, thus forming a hierarchy of clusters or nodes.By performing this sub-clustering iteratively, a hierarchical treestructure of coherent clusters is formed, to provide the taxonomy. Thecomputational complexity of this clustering process is proportional ton, the number of documents, K, the number of threshold evaluations andm, the average number of clusters per threshold.

The clustering process includes several steps for alleviating some ofthe scalability issues by reducing n and K. Whilst m is much smallerthat n, it is proportional to n, therefore reducing n also reduces m. Inone form, execution time is reduced by using percentage-based randomsampled clustering of the document space whereby the sampler 20 providesa fixed fraction of the document space to the clusterer 22 forclustering.

A second form is provided by stopping the clustering process after apredefined time interval in order to generate a clustered sample of thedocument space. These two forms of optimisation can be usedindependently or in conjunction.

After the initial clustering has been performed on a subset or sample ofthe document set, the remaining documents are subsequently assigned tothe clusters by one of three processes. The first process simplyclassifies documents into the existing clusters using the existingcluster centroids. That is, a new document is added to an existingcluster if its similarity to the cluster centroid falls within a fixedthreshold similarity value. Any documents failing the thresholdevaluation criteria for all clusters are set aside for later clustering.

The second process uses the sample document clusters as a training setfor an alternative document classification system. In this case, asupport vector machine (SVM) is used as an alternative classifier. TheSVM is described in the specification of International PatentApplication No. PCT/AU01/00415, incorporated herein by reference. Aswith the first process, any documents not classified are set aside forlater clustering.

The third process simply continues to cluster, but using the optimalthreshold similarity value determined whilst clustering the initialsample documents. This process forms new clusters for new documents thatare not similar to the existing clusters.

Each of these three processes is an approximation and assumes that theoriginal sample is representative of the complete (or future) documentspace. Consequently, errors are introduced over time as more documentsare added to the clusters due to cluster centroid drift. Two processesare used to combat this effect In the first process, the coherence ofthe clusters is maintained as the number of documents n increases byreducing the similarity threshold with increasing n.

In the second process, a new random sample better representing thepopulation is determined as the document collection grows. The newsample is used as a metric for evaluating the optimality of the existingclusters and/or as a means for determining a new quasi-optimalsimilarity threshold value for subsequent re-clustering of the documentspace to improve accuracy.

To reduce the time required by the search for an optimal orquasi-optimal similarity threshold value, cluster formation withdifferent threshold values can be performed by different threads or ondifferent processors in an SMP or distributed processing framework, suchas the parallel clusterers 28. The time spent searching for the optimumthreshold value is also reduced by using an efficient search processbased on knowledge of the topography of the goodness vs thresholdsimilarity curve. For example, FIG. 4 is a graph of the goodness valueof a document set, as described above, as a function of the logarithm ofthe similarity threshold value for cluster formation. The solid line 54joining data points has a well defined minimum 56 at a log (threshold)value near 0.2. The general shape of this graph is typical of alldocument sets. Knowing the approximate shape of this graph allows theoptimal threshold value for a particular document set to be locatedrapidly.

The taxonomy produced by clustering is stored in the taxonomy database24. After clustering, the postprocessor module 26 augments the clustereddata by extracting titles from metadata of each document, and addingsummary text generated by the clustering process, as described in theTACT specification. In cases where access logs (i.e., web server orproxy cache logs) are available for each document, the clusters and/ordocuments within each cluster can be ranked using the access frequencyof each document. For example, on a corporate web server, the mostpopular pages are listed near the top of each category listing, and/orthe most popular categories are listed near the top of a listing ofcategories.

The management system 16 includes an editor 21 that allows theadministrator to manually edit a taxonomy to create a new documenthierarchy. This new structure can then be used as the training set foradding further documents to the database using the classifier functionof the clusterer/classifier 22. The speed of document classification bythe categorisation system can be improved by using the parallelclassifiers 30 to classify many documents in parallel.

The editor 21 offers a number of editing functions, including movingbranches of the hierarchical taxonomy to other branches, editing metadescriptions for documents and branches, and creating, deleting, andmerging new branches in the taxonomy. The editor 12 presents informationfrom the taxonomy database 24 using HTML forms. Changes can then be madeto the taxonomy by modifying input fields in the forms and thensubmitting the changes via submit buttons of the forms.

The taxonomy rendering module 15 of the renderer 14 generates dynamicweb pages using the taxonomy database 24 to provide structure to theoriginal resource content These web pages can be accessed by providingto the web browser 32 a URI associated with the web server module 17.The visual presentation provided by these web pages is derived from aconfiguration file detailing the arrangement of the various fields onthe rendered page. The pages represent a web ‘view’ into the hierarchyusing a ‘directory’ style wherein the URI of the displayed pagecorresponds to the position or branch within the taxonomy that is beingbrowsed. Each level in the ‘view’ can contain documents and/orcategories, i.e., deeper branches in the taxonomy. Browsing into acategory produces a new view with a greater level of specificity. Eachbranch in the taxonomy is initially labelled automatically by extractingdescriptive information from the data during taxonomy generation, asdescribed above, and is manually editable by invoking the editor module21 of the management system 16. Documents are presented using theirtitles and summaries. Browsing to the document opens the document or arepresentation of the document.

Many modifications will be apparent to those skilled in the art withoutdeparting from the scope of the present invention as herein describedwith reference to the accompanying drawings.

The invention claimed is:
 1. A process for generating a taxonomy for aplurality of information resources in a communications network,including: (i) collecting said plurality of information resources fromsaid communications network; (ii) generating clusters of said pluralityof collected information resources on the basis of a similaritythreshold value for clustering and similarity values for said pluralityof collected information resources; (iii) iteratively generatingsub-clusters of said generated clusters based on the similaritythreshold value for clustering and similarity values for informationresources within each of said generated clusters and within each of saidgenerated sub-clusters, wherein the generated clusters and sub-clustersprovide a hierarchy of resource clusters, wherein the number of resourceclusters at each level of said hierarchy is determined by content ofsaid plurality of collected information resources; (iv) collectingfurther information resources from said communications network; (v)assigning the further collected information resources to a plurality ofthe resource clusters; (vi) maintaining the coherence of the pluralityof resource clusters as further collected information resources areassigned by at least one of: (a) reducing the similarity threshold valuefor clustering with an increasing number of the further collectedinformation resources; and (b) selecting a random subset of resourcesfrom the collected information resources; generating a new similaritythreshold value for clustering based on the selected random subset ofresources; and re-clustering the collected information resources usingthe generated new similarity threshold value for clustering; and (vii)repeating the steps of collecting further information resources,reducing similarity and maintaining the coherence.
 2. The process asclaimed in claim 1, wherein said step of iteratively generating includesselecting one or more of said generated clusters if one or moreintra-cluster similarity values for respective resources of saidgenerated clusters exceeds corresponding similarity threshold value, anditeratively generating sub-clusters of the selected one or more of saidgenerated clusters.
 3. The process as claimed in claim 1, wherein saidstep of generating clusters includes: selecting one or more portions ofsaid plurality of information resources on the basis of respectivemetrics of the relevance of said one or more portions, wherein thenumber of said generated clusters is determined on the basis of theselected one or more portions of said plurality of informationresources.
 4. The process as claimed in claim 3, wherein clusters aregenerated on the basis of similarity values generated from wordfrequency data generated from one or more selected portions of saidplurality of information resources.
 5. The process as claimed in claim1, including generating linked document data for displaying saidhierarchy of resource clusters.
 6. The process as claimed in claim 5,wherein said linked document data includes markup language data.
 7. Theprocess as claimed in claim 5, wherein said linked document dataincludes metadata of said plurality of information resources.
 8. Theprocess as claimed in claim 1, including generating descriptive text forsaid plurality of information resources and descriptive text for eachresource cluster of said hierarchy.
 9. The process as claimed in claim1, wherein said plurality of information resources include dynamicallygenerated content of said communications network.
 10. The process asclaimed in claim 1, wherein components of said hierarchy are sortedbased on access frequencies of said plurality of collected informationresources.
 11. The process as claimed in claim 1, wherein a resource isadded to an existing cluster if the similarity of said resource to saidcluster meets a similarity requirement.
 12. The process as claimed inclaim 1, wherein a new cluster is generated if the similarity ofresource to each existing cluster does not meet a similarityrequirement.
 13. The process as claimed in claim 1, wherein said step ofgenerating clusters includes determining the similarity threshold valuefor clustering on the basis of goodness values for respective groupingsof said plurality of collected information resources generated forrespective similarity threshold values.
 14. The process as claimed inclaim 13, wherein the goodness value for each grouping is generated onthe basis of similarity values for resources within the clusters of thegrouping and differences between cluster centroids for the clusters ofthe grouping and a global centroid for said resources.
 15. The processas claimed in claim 1, wherein said steps of generating clusters anditeratively generating sub-clusters are scalable with the number of saidplurality of collected information resources.
 16. The process as claimedin claim 1, wherein the plurality of information resources collected atstep (i) are a selected subset of a larger set of resources includingthe further information resources collected at step (iv).
 17. Theprocess as claimed in claim 16, including selecting said subset byrandom sampling of said larger set of resources.
 18. The process asclaimed in claim 16, including classifying resources into said resourceclusters.
 19. The process as claimed in claim 16, including using saidclusters generated at steps (ii) and (iii) as a training set for aclassifier used at step (v) to assign the further information resourcesto at least one of the resource clusters.
 20. The process as claimed inclaim 19, wherein said classifier includes a support vector machine. 21.The process as claimed in claim 16, including clustering using asimilarity value determined whilst clustering the subset of said largerset of resources.
 22. The process as claimed in claim 21, includingmaintaining the coherence of clusters as the further additionalresources are collected by reducing said similarity threshold value withincreasing number of resources.
 23. The process as claimed in claim 22,including determining a new similarity value for reclustering existingclusters on the basis of the quality of said existing clusters.
 24. Theprocess as claimed in claim 16, including maintaining generating a newhierarchy of resource clusters and classifying unselected resources intosaid new hierarchy of resource clusters.
 25. The process as claimed inclaim 16, including selecting a subset of clustered resources as thenumber of clustered resources increases to generate a metric forevaluating the quality of existing clusters.
 26. The process as claimedin claim 1, including generating one or more new clusters for resourcesthat are not substantially similar to existing clusters.
 27. The processas claimed in claim 1, wherein said step of iteratively generatingincludes selecting one or more of said generated clusters on the basisof respective measures of coherence of said generated clusters, anditeratively generating sub-clusters of the selected one or more of saidgenerated clusters.
 28. An information resource taxonomy system havingcomputer system hardware components, including at least one processoroperating according to one or more software modules, the at least oneprocessor and software modules configured to execute the steps ofclaim
 1. 29. A computer-readable storage medium, having stored thereonprogram code for executing the steps of claim
 1. 30. The process asclaimed in claim 1, wherein the content of resources in each clusterdetermines the number of sub-clusters generated from the cluster in theimmediately inferior level of the hierarchy.
 31. The process as claimedin claim 30, wherein the number of levels of said hierarchy isdetermined by content of said plurality of information resources. 32.The process as claimed in claim 1 wherein the number of levels of saidhierarchy is determined by content of said plurality of informationresources.
 33. An information resource taxonomy system, including: adata collector having computer system hardware components including atleast one processor operating according to one or more software modules,the at least one processor and one or more software modules configuredfor collecting information resources from a communications network; ataxonomy generator for generating clusters of said collected informationresources based on a similarity threshold value for clustering andsimilarity values for said collected information resources and foriteratively generating sub-clusters of said generated clusters based onthe similarity threshold value for clustering and similarity values forinformation resources within each of said generated clusters and withineach of said generated sub-clusters, wherein the generated clusters andsub-clusters provide a hierarchy of resource clusters, wherein thenumber of resource clusters in each level of said hierarchy isdetermined by content of said collected information resources; aclassifier configured to classify further information resourcescollected from the communication network to a plurality of the resourceclusters; and a component configured to maintain the coherence of theplurality of resource clusters as further information resources areclassified by at least one of: (a) reducing the similarity thresholdvalue for clustering with increasing numbers of the further collectedinformation resources; and (b) selecting a random subset of informationresources from the collected information resources; generating a newsimilarity threshold value for the selected random subset of informationresource; and re-clustering the collected information resources usingthe new similarity threshold value for clustering.
 34. The system asclaimed in claim 33, including an editor for editing said taxonomyhierarchy of resource clusters, and a renderer for generating linkeddocument data for displaying said hierarchy of resource clusters. 35.The system as claimed in claim 33, wherein said system is scalable withrespect to the number of said information resources.
 36. The system asclaimed in claim 33, including a parallel cluster search system forevaluating clusters in parallel.
 37. The system as claimed in claim 33,wherein the classifier is including a parallel classifier forclassifying further resources in parallel.
 38. The system as claimed inclaim 33, wherein the number resource clusters in each level of saidhierarchy is determined by content of resources, each resource clusterdetermines the number of sub-clusters generated from the resourcecluster in the immediately inferior level of the hierarchy.
 39. Thesystem as claimed in claim 38, wherein the number of levels of saidhierarchy is determined by content of said information resources. 40.The system as claimed in claim 33, wherein the number of levels of saidhierarchy is determined by content of said information resources.